covering number
Data-dependent Sample Complexity of Deep Neural Networks via Lipschitz Augmentation
Existing Rademacher complexity bounds for neural networks rely only on norm control of the weight matrices and depend exponentially on depth via a product of the matrix norms. Lower bounds show that this exponential dependence on depth is unavoidable when no additional properties of the training data are considered. We suspect that this conundrum comes from the fact that these bounds depend on the training data only through the margin. In practice, many data-dependent techniques such as Batchnorm improve the generalization performance. For feedforward neural nets as well as RNNs, we obtain tighter Rademacher complexity bounds by considering additional data-dependent properties of the network: the norms of the hidden layers of the network, and the norms of the Jacobians of each layer with respect to all previous layers. Our bounds scale polynomially in depth when these empirical quantities are small, as is usually the case in practice. To obtain these bounds, we develop general tools for augmenting a sequence of functions to make their composition Lipschitz and then covering the augmented functions. Inspired by our theory, we directly regularize the network's Jacobians during training and empirically demonstrate that this improves test performance.
On Measuring Excess Capacity in Neural Networks
We study the excess capacity of deep networks in the context of supervised classification. That is, given a capacity measure of the underlying hypothesis class - in our case, empirical Rademacher complexity - to what extent can we (a priori) constrain this class while retaining an empirical error on a par with the unconstrained regime?
Supplementary Material for " Non-Asymptotic Error Bounds for Bidirectional GANs "
Department of Mathematics, The Hong Kong University of Science and Technology Clear Water Bay, Hong Kong, China yyangdc@connect.ust.hk In this supplementary material, we first prove Theorem 3.2, and then Theorems 3.1 and 3.3. We use ฯ to denote the ReLU activation function in neural networks, which is ฯ (x) = max {x, 0}. We use notation O () and O () to express the order of function slightly differently, where O () omits the universal constant not relying on d while O () omits the constant related to d . So far, most of the related works assume that the target distribution ยต is supported on a compact set, for example Chen et al. (2020) and Liang (2020).
Provably Efficient RL for Linear MDPs under Instantaneous Safety Constraints in Non-Convex Feature Spaces
Roknilamouki, Amirhossein, Ghosh, Arnob, Shi, Ming, Nourzad, Fatemeh, Ekici, Eylem, Shroff, Ness B.
In Reinforcement Learning (RL), tasks with instantaneous hard constraints present significant challenges, particularly when the decision space is non-convex or non-star-convex. This issue is especially relevant in domains like autonomous vehicles and robotics, where constraints such as collision avoidance often take a non-convex form. In this paper, we establish a regret bound of $\tilde{\mathcal{O}}\bigl(\bigl(1 + \tfrac{1}{\tau}\bigr) \sqrt{\log(\tfrac{1}{\tau}) d^3 H^4 K} \bigr)$, applicable to both star-convex and non-star-convex cases, where $d$ is the feature dimension, $H$ the episode length, $K$ the number of episodes, and $\tau$ the safety threshold. Moreover, the violation of safety constraints is zero with high probability throughout the learning process. A key technical challenge in these settings is bounding the covering number of the value-function class, which is essential for achieving value-aware uniform concentration in model-free function approximation. For the star-convex setting, we develop a novel technique called Objective Constraint-Decomposition (OCD) to properly bound the covering number. This result also resolves an error in a previous work on constrained RL. In non-star-convex scenarios, where the covering number can become infinitely large, we propose a two-phase algorithm, Non-Convex Safe Least Squares Value Iteration (NCS-LSVI), which first reduces uncertainty about the safe set by playing a known safe policy. After that, it carefully balances exploration and exploitation to achieve the regret bound. Finally, numerical simulations on an autonomous driving scenario demonstrate the effectiveness of NCS-LSVI.
Bridging the Gap: Rademacher Complexity in Robust and Standard Generalization
Xiao, Jiancong, Sun, Ruoyu, Long, Qi, Su, Weijie J.
Training Deep Neural Networks (DNNs) with adversarial examples often results in poor generalization to test-time adversarial data. This paper investigates this issue, known as adversarially robust generalization, through the lens of Rademacher complexity. Building upon the studies by Khim and Loh (2018); Yin et al. (2019), numerous works have been dedicated to this problem, yet achieving a satisfactory bound remains an elusive goal. Existing works on DNNs either apply to a surrogate loss instead of the robust loss or yield bounds that are notably looser compared to their standard counterparts. In the latter case, the bounds have a higher dependency on the width $m$ of the DNNs or the dimension $d$ of the data, with an extra factor of at least $\mathcal{O}(\sqrt{m})$ or $\mathcal{O}(\sqrt{d})$. This paper presents upper bounds for adversarial Rademacher complexity of DNNs that match the best-known upper bounds in standard settings, as established in the work of Bartlett et al. (2017), with the dependency on width and dimension being $\mathcal{O}(\ln(dm))$. The central challenge addressed is calculating the covering number of adversarial function classes. We aim to construct a new cover that possesses two properties: 1) compatibility with adversarial examples, and 2) precision comparable to covers used in standard settings. To this end, we introduce a new variant of covering number called the \emph{uniform covering number}, specifically designed and proven to reconcile these two properties. Consequently, our method effectively bridges the gap between Rademacher complexity in robust and standard generalization.
Sequence Length Independent Norm-Based Generalization Bounds for Transformers
Since Vaswani et al. (2017) debuted the Transformer, it has become one of the most preeminent architectures of its time. It has achieved state of the art prediction capabilities in various fields (Dosovitskiy et al., 2020; Wu et al., 2022; Vaswani et al., 2017; Pettersson and Falkman, 2023) and an implementation of it has even passed the BAR exam (Katz et al., 2023). With such widespread use, the theoretical underpinnings of this architecture are of great interest. Specifically, this paper is concerned with bounding the generalization error when using the Transformer in supervised learning. Upper bounding this can be used to help understand how sample size needs to scale with different architecture parameters and is a very common theoretical tool to understand machine learning algorithms (Kakade et al., 2008; Garg et al., 2020; Truong, 2022; Lin and Zhang, 2019).
Approximate Description Length, Covering Numbers, and VC Dimension
Daniely, Amit, Katzhendler, Gal
Neural Networks are a widely used tool nowadays, despite the lack of theoretical background supporting their abilities to generalize well. Classical notions of learning guarantee generalization only if there are more examples that parameters. It is clear that a stronger assumption is needed to achieve tighter bounds, and indeed, different types of assumptions were used in order to fill this empirical-theoretical gap, including assumptions on robustness to noise [2], bias of the learning algorithm [5, 10], and norm bounds on the weight's matrices [8, 9] The idea of Approximate Description Length [4] was conceived as a part of the line of research working under assumptions that bound the magnitude of the network's weight matrices.